preference-based reinforcement learning
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > California (0.04)
- North America > Canada (0.04)
- Asia > Middle East > Jordan (0.04)
Meta-Reward-Net: Implicitly Differentiable Reward Learning for Preference-based Reinforcement Learning
Setting up a well-designed reward function has been challenging for many reinforcement learning applications. Preference-based reinforcement learning (PbRL) provides a new framework that avoids reward engineering by leveraging human preferences (i.e., preferring apples over oranges) as the reward signal. Therefore, improving the efficacy of data usage for preference data becomes critical. In this work, we propose Meta-Reward-Net (MRN), a data-efficient PbRL framework that incorporates bi-level optimization for both reward and policy learning. The key idea of MRN is to adopt the performance of the Q-function as the learning target. Based on this, MRN learns the Q-function and the policy in the inner level while updating the reward function adaptively according to the performance of the Q-function on the preference data in the outer level. Our experiments on robotic simulated manipulation tasks and locomotion tasks demonstrate that MRN outperforms prior methods in the case of few preference labels and significantly improves data efficiency, achieving state-of-the-art in preference-based RL. Ablation studies further demonstrate that MRN learns a more accurate Q-function compared to prior work and shows obvious advantages when only a small amount of human feedback is available. The source code and videos of this project are released at https://sites.google.com/view/meta-reward-net.
Preference-based Reinforcement Learning with Finite-Time Guarantees
Preference-based Reinforcement Learning (PbRL) replaces reward values in traditional reinforcement learning by preferences to better elicit human opinion on the target objective, especially when numerical reward values are hard to design or interpret. Despite promising results in applications, the theoretical understanding of PbRL is still in its infancy. In this paper, we present the first finite-time analysis for general PbRL problems. We first show that a unique optimal policy may not exist if preferences over trajectories are deterministic for PbRL. If preferences are stochastic, and the preference probability relates to the hidden reward values, we present algorithms for PbRL, both with and without a simulator, that are able to identify the best policy up to accuracy $\varepsilon$ with high probability. Our method explores the state space by navigating to under-explored states, and solves PbRL using a combination of dueling bandits and policy search. Experiments show the efficacy of our method when it is applied to real-world problems.
SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning
Ni, Hexian, Lu, Tao, Hu, Haoyuan, Cai, Yinghao, Wang, Shuo
Preference-based Reinforcement Learning (PbRL) methods provide a solution to avoid reward engineering by learning reward models based on human preferences. However, poor feedback- and sample- efficiency still remain the problems that hinder the application of PbRL. In this paper, we present a novel efficient query selection and preference-guided exploration method, called SENIOR, which could select the meaningful and easy-to-comparison behavior segment pairs to improve human feedback-efficiency and accelerate policy learning with the designed preference-guided intrinsic rewards. Our key idea is twofold: (1) We designed a Motion-Distinction-based Selection scheme (MDS). It selects segment pairs with apparent motion and different directions through kernel density estimation of states, which is more task-related and easy for human preference labeling; (2) We proposed a novel preference-guided exploration method (PGE). It encourages the exploration towards the states with high preference and low visits and continuously guides the agent achieving the valuable samples. The synergy between the two mechanisms could significantly accelerate the progress of reward and policy learning. Our experiments show that SENIOR outperforms other five existing methods in both human feedback-efficiency and policy convergence speed on six complex robot manipulation tasks from simulation and four real-worlds.
PB$^2$: Preference Space Exploration via Population-Based Methods in Preference-Based Reinforcement Learning
Driss, Brahim, Davey, Alex, Akrour, Riad
Preference-based reinforcement learning (PbRL) has emerged as a promising approach for learning behaviors from human feedback without predefined reward functions. However, current PbRL methods face a critical challenge in effectively exploring the preference space, often converging prematurely to suboptimal policies that satisfy only a narrow subset of human preferences. In this work, we identify and address this preference exploration problem through population-based methods. We demonstrate that maintaining a diverse population of agents enables more comprehensive exploration of the preference landscape compared to single-agent approaches. Crucially, this diversity improves reward model learning by generating preference queries with clearly distinguishable behaviors, a key factor in real-world scenarios where humans must easily differentiate between options to provide meaningful feedback. Our experiments reveal that current methods may fail by getting stuck in local optima, requiring excessive feedback, or degrading significantly when human evaluators make errors on similar trajectories, a realistic scenario often overlooked by methods relying on perfect oracle teachers. Our population-based approach demonstrates robust performance when teachers mislabel similar trajectory segments and shows significantly enhanced preference exploration capabilities,particularly in environments with complex reward landscapes.
Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning
Text-to-image generative models have recently attracted considerable interest, enabling the synthesis of high-quality images from textual prompts. However, these models often lack the capability to generate specific subjects from given reference images or to synthesize novel renditions under varying conditions. Methods like DreamBooth and Subject-driven Text-to-Image (SuTI) have made significant progress in this area. Yet, both approaches primarily focus on enhancing similarity to reference images and require expensive setups, often overlooking the need for efficient training and avoiding overfitting to the reference images. In this work, we present the \lambda -Harmonic reward function, which provides a reliable reward signal and enables early stopping for faster training and effective regularization.
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.40)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.40)
Review for NeurIPS paper: Preference-based Reinforcement Learning with Finite-Time Guarantees
This paper generated considerable discussion among the reviewers. One the positive side, this paper makes a solid contribution to the emerging literature on preference-based RL, a topic of some importance and makes some interesting insights (e.g., on the potential lack of a "winning policy") and novel algorithmic contributions. Conversely, some reviewers raised issues with some of the assumptions made in the paper and the presentation (which seems to assume familiarity with PBRL and its motivations/rationale. The author response was thoughtful and generated some discussion (some of which is not reflected in the reviews, a couple of which failed to get updated unfortunately). On my own reading if the paper, I agree that the paper makes a useful contribution to PBRL, especially from a technical perspective and conceptual perspective (although I don't believe it makes PBRL more practical at this stage).
Review for NeurIPS paper: Preference-based Reinforcement Learning with Finite-Time Guarantees
Weaknesses: There are two main weaknesses. First, I'm not sure whether the algorithm is meant to be the core contribution, or the analysis. If it's the algorithm, then the paper needs to actually test the algorithm in more than toy settings (and ideally with real humans, rather than simulating answers with BLT with two parameter settings). But if it's the analysis, I almost feel like the experiments are distracting, or at least overstating and drawing away from the main contributions. I'd love to hear the authors' perspective on this, but my suggestion would be to either a) get the best of both worlds by running a more serious experiment, or b) edit the paper to highlight the analysis and justify the experiments as showing what the algorithm does empirically and perhaps aiding with some qualitative analysis of the resulting behavior when applied to simple tasks, aiding in the understanding of the algorithm.
Meta-Reward-Net: Implicitly Differentiable Reward Learning for Preference-based Reinforcement Learning
Setting up a well-designed reward function has been challenging for many reinforcement learning applications. Preference-based reinforcement learning (PbRL) provides a new framework that avoids reward engineering by leveraging human preferences (i.e., preferring apples over oranges) as the reward signal. Therefore, improving the efficacy of data usage for preference data becomes critical. In this work, we propose Meta-Reward-Net (MRN), a data-efficient PbRL framework that incorporates bi-level optimization for both reward and policy learning. The key idea of MRN is to adopt the performance of the Q-function as the learning target.
Online Preference-based Reinforcement Learning with Self-augmented Feedback from Large Language Model
Tu, Songjun, Sun, Jingbo, Zhang, Qichao, Lan, Xiangyuan, Zhao, Dongbin
Preference-based reinforcement learning (PbRL) provides a powerful paradigm to avoid meticulous reward engineering by learning rewards based on human preferences. However, real-time human feedback is hard to obtain in online tasks. Most work suppose there is a "scripted teacher" that utilizes privileged predefined reward to provide preference feedback. In this paper, we propose a RL Self-augmented Large Language Model Feedback (RL-SaLLM-F) technique that does not rely on privileged information for online PbRL. RL-SaLLM-F leverages the reflective and discriminative capabilities of LLM to generate self-augmented trajectories and provide preference labels for reward learning. First, we identify an failure issue in LLM-based preference discrimination, specifically "query ambiguity", in online PbRL. Then LLM is employed to provide preference labels and generate self-augmented imagined trajectories that better achieve the task goal, thereby enhancing the quality and efficiency of feedback. Additionally, a double-check mechanism is introduced to mitigate randomness in the preference labels, improving the reliability of LLM feedback. The experiment across multiple tasks in the MetaWorld benchmark demonstrates the specific contributions of each proposed module in RL-SaLLM-F, and shows that self-augmented LLM feedback can effectively replace the impractical "scripted teacher" feedback. In summary, RL-SaLLM-F introduces a new direction of feedback acquisition in online PbRL that does not rely on any online privileged information, offering an efficient and lightweight solution with LLM-driven feedback.
- Asia > China (0.05)
- North America > United States > Michigan > Wayne County > Detroit (0.04)